Code
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from IPython.display import Markdown, display
from sklearn.linear_model import LinearRegressionWelcome to Boston Massachusetts in the 1970s! Imagine you're working for a real estate development company. Your company wants to value any residential project before they start. You are tasked with building a model that can provide a price estimate based on a home's characteristics like:
To accomplish your task you will:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from IPython.display import Markdown, display
from sklearn.linear_model import LinearRegressionThe first column in the .csv file just has the row numbers, so it will be used as the index.
data = pd.read_csv('data/boston.csv', index_col=0)Characteristics:
This is a copy of UCI ML housing dataset. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. You can find the original research paper here.
Challenge
data?data.shape
data.columns
data.isna().sum()
data.duplicated().sum()0
Challenge
print(f'The average pupil-to-teacher ratio in the data is {data["PTRATIO"].mean():.2f}.')The average pupil-to-teacher ratio in the data is 18.46.
print(f'The average price of a home in the dataset is ${1000 * data["PRICE" ].mean():,.2f}.')The average price of a home in the dataset is $22,532.81.
CHAS feature?print(f'It is a variable indicating whether a property is next to the Carles River or not. In the dataset {100 * data[data["CHAS"] == 1].shape[0] / data.shape[0]:.2}% of the properties were next to the Charles river.')It is a variable indicating whether a property is next to the Carles River or not. In the dataset 6.9% of the properties were next to the Charles river.
print(f'The minumum number of rooms is {data["RM"].min():.1f}.')
print(f'The maximum number of rooms is {data["RM"].max():.1f}.')The minumum number of rooms is 3.6.
The maximum number of rooms is 8.8.
Challenge: Having looked at some descriptive statistics, visualise the data for your model. Use Seaborn .displot() to create a bar chart and superimpose the Kernel Density Estimate (KDE) for the following variables:
sns.displot(data, x = "PRICE")
plt.show()sns.displot(data, x = "RM")
plt.show()sns.displot(data, x = "DIS")
plt.show()sns.displot(data, x = "RAD")
plt.show()Challenge
Create a bar chart with plotly for CHAS to show many more homes are away from the river versus next to it. The bar chart should look something like this:
You can make your life easier by providing a list of values for the x-axis (e.g., x=['No', 'Yes'])
close_river = data.groupby("CHAS", as_index = False).size()
close_river["CHAS"] = close_river["CHAS"].astype(str)
fig = px.bar(close_river, x = "CHAS", y = "size", labels={"CHAS": "Close to the charles River?", "size": "Number of properties"})
fig.update_xaxes(type = "category", labelalias = {"0.0": "Not close", "1.0": "Close"})
fig.show()Challenge
There might be some relationships in the data that we should know about. Before you run the code, make some predictions:
Run a Seaborn .pairplot() to visualise all the relationships at the same time. Note, this is a big task and can take 1-2 minutes! After it's finished check your intuition regarding the questions above on the pairplot.
sns.pairplot(data)
plt.show()Challenge
Use Seaborn's .jointplot() to look at some of the relationships in more detail. Create a jointplot for:
sns.jointplot(data, x = "DIS", y = "NOX", alpha = .3)
plt.show()sns.jointplot(data, x= "INDUS", y = "NOX", alpha = .3)
plt.show()sns.jointplot(data, x = "LSTAT", y = "RM", alpha = .3)
plt.show()sns.jointplot(data, x = "LSTAT", y = "PRICE", alpha = .3)
plt.show()sns.jointplot(data, x = "RM", y = "PRICE", alpha = .3)
plt.show()Try adding some opacity or alpha to the scatter plots using keyword arguments under joint_kws.
We can't use all 506 entries in our dataset to train our model. The reason is that we want to evaluate our model on data that it hasn't seen yet (i.e., out-of-sample data). That way we can get a better idea of its performance in the real world.
Challenge
train_test_split() function from sklearnfrom sklearn.model_selection import train_test_splitrandom_state=10. This helps us get the same results every time and avoid confusion while we're learning.Hint: Remember, your target is your home PRICE, and your features are all the other columns you'll use to predict the price.
X_train, X_test, y_train, y_test = train_test_split(data.drop(columns = ["PRICE"]), data["PRICE"], test_size = .2, random_state = 10)In a previous lesson, we had a linear model with only a single feature (our movie budgets). This time we have a total of 13 features. Therefore, our Linear Regression model will have the following form:
\[ PRICE = \theta_0 + \theta_1 RM + \theta_2 NOX + \theta_3 DIS + \theta_4 CHAS + \ldots + \theta_{13} LSTAT\]
Challenge
Use sklearn to run the regression on the training dataset. How high is the r-squared for the regression on the training data?
reg = LinearRegression().fit(X_train, y_train)Here we do a sense check on our regression coefficients. The first thing to look for is if the coefficients have the expected sign (positive or negative).
Challenge Print out the coefficients (the thetas in the equation above) for the features. Hint: You'll see a nice table if you stick the coefficients in a DataFrame.
coefficients_train = pd.DataFrame(zip(X_train.columns, reg.coef_))
coefficients_train.columns = ["Feature", "Coefficient"]
display(Markdown(coefficients_train.to_markdown()))| Feature | Coefficient | |
|---|---|---|
| 0 | CRIM | -0.128181 |
| 1 | ZN | 0.0631982 |
| 2 | INDUS | -0.00757628 |
| 3 | CHAS | 1.97451 |
| 4 | NOX | -16.272 |
| 5 | RM | 3.10846 |
| 6 | AGE | 0.0162922 |
| 7 | DIS | -1.48301 |
| 8 | RAD | 0.303988 |
| 9 | TAX | -0.0120821 |
| 10 | PTRATIO | -0.820306 |
| 11 | B | 0.011419 |
| 12 | LSTAT | -0.581626 |
coef_rm = coefficients_train[coefficients_train['Feature'] == 'RM']['Coefficient'].iloc[0]
print(f"The coefficient of the number of rooms in the regression is {coef_rm:.2f}, which is {np.where(coef_rm < 0, 'negative', 'positive')}, and thus aligned with the expectation.")The coefficient of the number of rooms in the regression is 3.11, which is positive, and thus aligned with the expectation.
coef_lstat = coefficients_train[coefficients_train['Feature'] == 'LSTAT']['Coefficient'].iloc[0]
print(f"The coefficient of the percentage of lower status to the total population in the regression is {coef_lstat:.2f}, which is {np.where(coef_lstat < 0, 'negative', 'positive')}, and thus aligned with the expectation.")The coefficient of the percentage of lower status to the total population in the regression is -0.58, which is negative, and thus aligned with the expectation.
print(f"For each extra room, a property is ${1000 * coef_rm:,.2f} more expensive, all other factors kept constant.")For each extra room, a property is $3,108.46 more expensive, all other factors kept constant.
The next step is to evaluate our regression. How good our regression is depends not only on the r-squared. It also depends on the residuals - the difference between the model's predictions (\(\hat{y_i}\)) and the true values (\(y_i\)) inside y_train.
predicted_values = reg.predict(X_train)
residuals = (y_train - predicted_values)Challenge: Create two scatter plots.
The first plot should be actual values (y_train) against the predicted value values:
fig, ax = plt.subplots()
ax.scatter(y_train, predicted_values, color = "blue", alpha = .3)
ax.axline((0, 0), slope = 1, color = "cyan")
ax.set_ylabel("Predicted values")
ax.set_xlabel("Actual prices (training set)")
plt.show()The cyan line in the middle shows y_train against y_train. If the predictions had been 100% accurate then all the dots would be on this line. The further away the dots are from the line, the worse the prediction was. That makes the distance to the cyan line, you guessed it, our residuals 😊
The second plot should be the residuals against the predicted prices. Here's what we're looking for:
f, ax = plt.subplots()
ax.scatter(predicted_values, residuals, color = "purple", alpha = .3)
ax.set_xlabel("Predicted values")
ax.set_ylabel("Residuals")
plt.show()Why do we want to look at the residuals? We want to check that they look random. Why? The residuals represent the errors of our model. If there's a pattern in our errors, then our model has a systematic bias.
We can analyse the distribution of the residuals. In particular, we're interested in the skew and the mean.
In an ideal case, what we want is something close to a normal distribution. A normal distribution has a skewness of 0 and a mean of 0. A skew of 0 means that the distribution is symmetrical - the bell curve is not lopsided or biased to one side. Here's what a normal distribution looks like:
Challenge
from scipy.stats import skew
print(f"The mean of the residuals is {residuals.mean():.2f}, and the skewness is {skew(residuals):.2f}")The mean of the residuals is 0.00, and the skewness is 1.45
.displot() to create a histogram and superimpose the Kernel Density Estimate (KDE)sns.displot(residuals.to_frame(), x = "PRICE")
plt.show()We have two options at this point:
Let's try a data transformation approach.
Challenge
Investigate if the target data['PRICE'] could be a suitable candidate for a log transformation.
.displot() to show a histogram and KDE of the price data.sns.displot(data, x = "PRICE")
plt.show()
print(f"The skewness of the Price original data is {skew(data['PRICE']):.2f}.")The skewness of the Price original data is 1.10.
log() function to create a Series that has the log prices.displot() and calculate the skew.data["log_price"] = np.log(data["PRICE"])
sns.displot(data, x = "log_price")
plt.show()
print(f"The skewness of the log-transformed price is {skew(data['log_price']):.2f}.")The skewness of the log-transformed price is -0.33.
Using a log transformation does not affect every price equally. Large prices are affected more than smaller prices in the dataset. Here's how the prices are "compressed" by the log transformation:
We can see this when we plot the actual prices against the (transformed) log prices.
f, ax = plt.subplots()
ax.scatter(data["PRICE"], data["log_price"], alpha = .3)
ax.set_ylabel("Log-transformed price")
ax.set_xlabel("Price")
plt.show()Using log prices instead, our model has changed to:
\[ \log{PRICE} = \theta_0 + \theta_1 RM + \theta_2 NOX + \theta_3 DIS + \theta_4 CHAS + \ldots + \theta_{13} LSTAT \]
Challenge:
train_test_split() with the same random state as before to make the results comparable.X_log_train, X_log_test, y_log_train, y_log_test = train_test_split(data.drop(columns = ["PRICE", "log_price"]), data["log_price"], test_size = .2, random_state = 10)
reg_log = LinearRegression().fit(X_log_train, y_log_train)Challenge: Print out the coefficients of the new regression model.
coefficients_log_train = pd.DataFrame(zip(X_log_train.columns, reg_log.coef_))
coefficients_log_train.columns = ["Feature", "Coefficient"]
display(Markdown(coefficients_log_train.to_markdown()))| Feature | Coefficient | |
|---|---|---|
| 0 | CRIM | -0.0106717 |
| 1 | ZN | 0.00157929 |
| 2 | INDUS | 0.0020299 |
| 3 | CHAS | 0.0803305 |
| 4 | NOX | -0.704068 |
| 5 | RM | 0.0734044 |
| 6 | AGE | 0.000763302 |
| 7 | DIS | -0.0476333 |
| 8 | RAD | 0.0145651 |
| 9 | TAX | -0.000644998 |
| 10 | PTRATIO | -0.0347948 |
| 11 | B | 0.000515896 |
| 12 | LSTAT | -0.0313901 |
Hint: Use a DataFrame to make the output look pretty.
Challenge:
indigo as the colour for the original regression and navy for the color using log prices.predicted_log_values = reg_log.predict(X_log_train)
residuals_log = (y_log_train - predicted_log_values)
fig, ax = plt.subplots()
ax.scatter(y_log_train, predicted_log_values, color = "blue", alpha = .3)
ax.axline((1.5, 1.5), slope = 1, color = "cyan")
ax.set_ylabel("Predicted log values")
ax.set_xlabel("Actual log prices (training set)")
plt.show()
f, ax = plt.subplots()
ax.scatter(predicted_log_values, residuals_log, color = "purple", alpha = .3)
ax.set_xlabel("Predicted log values")
ax.set_ylabel("Residuals (log)")
plt.show()
print(f"The mean of the residuals_log is {residuals_log.mean():.2f}, and the skewness is {skew(residuals_log):.2f}")
sns.displot(residuals_log.to_frame(), x = "log_price")
plt.show()The mean of the residuals_log is -0.00, and the skewness is 0.09
Challenge:
Calculate the mean and the skew for the residuals using log prices. Are the mean and skew closer to 0 for the regression using log prices?
The real test is how our model performs on data that it has not "seen" yet. This is where our X_test comes in.
Challenge
Compare the r-squared of the two models on the test dataset. Which model does better? Is the r-squared higher or lower than for the training dataset? Why?
from sklearn.metrics import r2_score
predicted_test_normal = reg.predict(X_test)
predicted_test_log = reg_log.predict(X_log_test)
r2_training_normal = r2_score(y_train, predicted_values)
r2_training_log = r2_score(y_log_train, predicted_log_values)
r2_test_normal = r2_score(y_test, predicted_test_normal)
r2_test_log = r2_score(y_log_test, predicted_test_log)
df_r2 = pd.DataFrame({"Type": ["Training set, normal", "Training set, log", "Test set, normal", "Test set, log"], "R2": [r2_training_normal, r2_training_log, r2_test_normal, r2_test_log]})
display(Markdown(df_r2.to_markdown(index = False)))| Type | R2 |
|---|---|
| Training set, normal | 0.750122 |
| Training set, log | 0.793023 |
| Test set, normal | 0.670934 |
| Test set, log | 0.744692 |
Our preferred model now has an equation that looks like this:
\[ \log{PRICE} = \theta_0 + \theta_1 RM + \theta_2 NOX + \theta_3 DIS + \theta_4 CHAS + ... + \theta_{13} LSTAT \]
The average property has the mean value for all its charactistics:
features = data.drop(['PRICE', "log_price"], axis=1)
average_vals = features.mean().values
property_stats = pd.DataFrame(data=average_vals.reshape(1, len(features.columns)),
columns=features.columns)
display(Markdown(property_stats.T.to_markdown()))| 0 | |
|---|---|
| CRIM | 3.61352 |
| ZN | 11.3636 |
| INDUS | 11.1368 |
| CHAS | 0.06917 |
| NOX | 0.554695 |
| RM | 6.28463 |
| AGE | 68.5749 |
| DIS | 3.79504 |
| RAD | 9.54941 |
| TAX | 408.237 |
| PTRATIO | 18.4555 |
| B | 356.674 |
| LSTAT | 12.6531 |
Challenge
Predict how much the average property is worth using the stats above. What is the log price estimate and what is the dollar estimate? You'll have to reverse the log transformation with .exp() to find the dollar value.
Challenge
Keeping the average values for CRIM, RAD, INDUS and others, value a property with the following characteristics:
x_particular = property_stats.copy()
x_particular["CHAS"] = 1
x_particular["RM"] = 8
x_particular["PTRATIO"] = 20
x_particular["DIS"] = 5
x_particular["NOX"] = data["NOX"].quantile(.75) # high
x_particular["LSTAT"] = data["LSTAT"].quantile(.25) # low
y_particular = reg_log.predict(x_particular)
print(f"The property has a predicted value of ${np.exp(y_particular)[0] * 1000:,.2f}.")The property has a predicted value of $25,792.03.